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Abstract 

We  consider  an  approximation  scheme  for  solving  Markov  Decision  Processes  (MDPs)  with 
countable  state  space,  finite  action  space,  and  bounded  rewards  that  uses  an  approximate  solu¬ 
tion  of  a  fixed  finite-horizon  sub-MDP  of  a  given  infinite-horizon  MDP  to  create  a  stationary 
policy,  which  we  call  “approximate  receding  horizon  control” .  We  first  analyze  the  performance 
of  the  approximate  receding  horizon  control  for  infinite-horizon  average  reward  under  an  ergod- 
icity  assumption,  which  also  generalizes  the  result  obtained  by  White  [36].  We  then  study  two 
examples  of  the  approximate  receding  horizon  control  via  lower  bounds  to  the  exact  solution  to 
the  sub-MDP.  The  first  control  policy  is  based  on  a  finite-horizon  approximation  of  Howard’s 
policy  improvement  of  a  single  policy  and  the  second  policy  is  based  on  a  generalization  of  the 
single  policy  improvement  for  multiple  policies.  Along  the  study,  we  also  provide  a  simple  alter¬ 
native  proof  on  the  policy  improvement  for  countable  state  space.  We  finally  discuss  practical 
implementations  of  these  schemes  via  simulation. 

Keywords:  Markov  decision  process,  receding  horizon  control,  infinite  horizon  average  reward, 
policy  improvement,  rollout,  ergodicity 
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1  Introduction 

We  consider  an  approximation  scheme  for  solving  Markov  Decision  Processes  (MDPs)  with  count¬ 
able  state  space,  finite  action  space,  and  bounded  rewards.  The  scheme,  which  we  call  “approximate 
receding  horizon  control” ,  uses  an  approximate  solution  of  a  fixed  finite-horizon  sub-MDP  of  a  given 
infinite-horizon  MDP  to  create  a  stationary  policy  to  solve  the  infinite-horizon  MDP. 

The  idea  of  receding  horizon  control  has  been  applied  to  many  interesting  problems  in  various 
contexts  to  solve  the  problems  in  an  “on-/ine”  manner,  where  in  this  case  we  obtain  an  optimal 
exact  solution  with  respect  to  a  “small”  moving  horizon  at  each  decision  time  and  apply  the 
solution  to  the  system.  For  example,  it  has  been  applied  to  planning  problems  (e.g.,  inventory 
control)  that  can  be  modeled  as  linear  programs  [14]  and  that  can  be  represented  as  a  shortest  path 
problem  in  an  acyclic  network  (see  [13]  for  example  problems  and  references  therein),  a  routing 
problem  in  a  communication  network  by  formulating  the  problem  as  a  nonlinear  optimal  control 
problem  [2],  dynamic  games  [8],  aircraft  tracking  [31],  the  stabilization  of  nonlinear  time- varying 
systems  [21,  26,  28]  in  the  model  predictive  control  literature,  and  macroplanning  in  economics  [20], 
etc.  The  intuition  behind  the  approach  is  that  if  the  horizon  is  “long”  enough  to  obtain  a  stationary 
behavior  of  the  system,  the  moving  horizon  control  would  have  good  performance.  Indeed,  for 
MDPs,  Hernandez-Lerma  and  Lasserre  [16]  showed  that  the  value  of  the  receding  horizon  control 
converges  geometrically  to  the  optimal  value,  uniformly  in  the  initial  state,  as  the  value  of  the 
moving  horizon  increases.  For  infinite-horizon  discounted  reward  case,  it  converges  geometrically 
fast  with  a  given  discounting  factor  in  (0,1),  and  for  infinite-horizon  average  reward  case,  it  converges 
geometrically  fast  with  a  given  “ergodicity  coefficient”  in  (0,1).  Furthermore,  it  has  been  shown 
that  there  always  exists  a  minimal  finite  horizon  H  such  that  the  receding  H- horizon  control 
prescribes  exactly  the  same  action  as  the  policy  that  achieves  the  optimal  infinite-horizon  rewards 
at  every  state  (see  [6]  for  the  discounted  case,  and  [17]  for  the  average  reward  case  with  ergodicity 
assumptions). 

Unfortunately,  a  large  state-space  size  makes  it  almost  impossible  to  solve  the  MDPs  in  practice 
even  with  a  relatively  small  receding  horizon.  Motivated  by  this,  we  first  analyze  the  performance 
of  the  approximate  receding  horizon  control  for  the  infinite-horizon  average  reward.  The  analysis 
also  generalizes  the  result  obtained  by  White  [36]  for  finite  state  space  with  a  unichain  assumption. 
We  show  that  the  infinite-horizon  average  reward  obtained  by  following  the  approximate  receding 
horizon  control  is  bounded  by  the  error  due  to  the  finite-horizon  approximation  that  approaches 
to  zero  geometrically  fast  with  a  given  ergodicity  coefficient  and  the  error  due  to  the  approxima¬ 
tion  of  the  optimal  finite-horizon  value  so  that  if  the  receding  horizon  is  “long”  enough  and  the 
approximation  of  the  optimal  finite-horizon  value  is  good,  the  performance  bound  will  be  relatively 
small. 

We  then  study  two  examples  of  approximate  receding  horizon  control  via  lower  bounds  to  the 
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exact  solution  of  the  sub-MDP  problem  of  the  given  infinite-horizon  MDP,  where  both  examples 
can  be  implemented  easily  by  Monte-Carlo  simulation.  The  first  control  policy  is  based  on  a  finite- 
horizon  approximation  of  Howard’s  policy  improvement  of  a  single  policy  and  the  second  policy 
is  based  on  a  generalization  of  the  single  policy  improvement  for  multiple  policies.  In  the  study 
of  the  first  policy,  we  provide  a  simple  alternative  proof  of  the  policy  improvement  principle  for 
countable  state  space,  which  is  rather  cumbersome  to  prove  (see,  e.g.,  Chapter  7  in  [12]  for  a 
proof  via  the  vanishing  discount  approach  for  finite  state  space  or  [27]  for  general  state  space). 
The  Monte-Carlo  simulation  implementation  of  the  first  policy  is  an  extension  of  an  “on-line” 
simulation  method,  called  “rollout”,  proposed  by  Bertsekas  and  Castanon  [4]  to  solve  MDPs  for 
the  total  reward  criterion. 

The  rollout  approach  is  promising  if  we  have  a  good  base  policy.  Indeed,  several  recent  works 
reported  successful  results  in  this  direction  (see  Subsection  4.1  of  the  present  paper  for  a  brief 
survey).  Suppose  we  have  multiple  base  policies  available  instead  of  a  single  base  policy.  Because 
we  cannot  predict  each  policy’s  performance  easily  in  advance,  it  is  difficult  to  select  which  policy 
to  rollout  or  to  use  to  be  improved  upon.  Furthermore,  it  is  often  true  that  the  available  policies  are 
distinct  in  that  each  policy’s  performance  is  good  in  different  sample  paths,  in  which  case  one  wish 
to  combine  the  multiply  available  policies  to  create  a  single  control  policy.  To  this  end,  we  consider 
a  generalization  of  the  single  policy  improvement  for  multiple  policies  and  study  its  properties.  One 
of  the  properties  of  the  generalized  policy  improvement  principle  is  that  if  there  exists  a  “best” 
policy  that  achieves  both  the  best  bias  and  the  best  gain  among  the  multiple  policies,  the  generalized 
policy  improvement  method  improves  the  infinite-horizon  average  reward  of  the  best  policy  in  the 
set.  As  in  the  rollout  policy  case,  we  also  approximate  the  generalized  policy  improvement  principle 
in  a  finite  horizon  sense,  generalizing  the  rollout  policy.  We  call  the  resulting  policy  as  “parallel 
rollout”.  We  analyze  the  performances  of  the  two  example  policies  relative  to  the  policies  being 
rolled  out  within  the  framework  of  the  approximate  receding  horizon  approach. 

All  of  the  analysis  in  this  paper  is  based  on  an  “ergodicity”  assumption  on  a  given  MDP  as  in  the 
work  of  Hernandez-Lerma  and  Lasserre  [16] .  This  assumption  allows  us  to  discuss  the  relationship 
between  the  value  of  the  receding  horizon  and  the  performance  of  the  approximate  receding  horizon 
approach.  We  note  that  analysis  work  along  this  line  for  the  cases  of  infinite-horizon  discounted 
reward  and  total  reward  are  reported  in  [10]. 

This  paper  is  organized  as  follows.  In  Section  2,  we  formally  introduce  Markov  decision  processes 
and  in  Section  3,  we  define  the  (approximate)  receding  horizon  control  and  analyze  its  performance. 
We  then  provide  two  examples  of  the  approximate  receding  horizon  control  and  analyze  their 
performances  in  Section  4.  We  conclude  the  present  paper  in  Section  5. 
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2  Markov  Decision  Process 

In  this  section,  we  present  the  essentials  of  the  MDPs  we  consider  and  the  properties  we  use  in  the 
present  paper.  For  a  more  substantial  introduction,  see  Puterman’s  book  [33]  or  the  survey  paper 
by  Arapostathis  et  al.  [1].  We  consider  an  MDP  with  a  countable  state  set  X,  a  finite  action  set  A, 
a  nonnegative  and  bounded  reward  function  R  such  that  R  :  X  x  A  — ►  TZ+,  and  a  state  transition 
function  P  that  maps  the  state  and  action  pair  to  a  probability  distribution  over  X.  We  will  denote 
the  probability  of  transitioning  to  state  y  £  X  from  state  x  £  X  by  taking  an  action  a  £  A  at  x  as 
p(y\x,a).  For  simplicity,  we  assume  that  every  action  is  admissible  at  each  state. 

Define  a  stationary  policy  tt  as  a  function  tt  :  X  — ►  A  and  denote  II  as  the  set  of  all  possible  sta¬ 
tionary  policies.  Given  an  initial  state  x,  we  define  the  infinite-horizon  average  reward  of  following 
a  policy  n  £  II  as 

J£o(x)  ■=  !immf  j  ^  R(xt,  n(xt))  x0  =  x\  ,  (1) 

where  xt  is  a  random  variable  denoting  the  state  at  time  t  following  the  policy  it  and  we  use  the 
subscript  oo  to  emphasize  the  infinite  horizon.  We  seek  an  optimal  policy  that  achieves 

Jo o(x)  =  sup  JZo(x),x  £  X. 

7r£n 

Because  there  might  not  always  exist  such  an  optimal  policy  in  II  that  achieves  J^(x)  [1,  12,  33], 
we  impose  an  ergodicity  assumption  throughout  the  present  paper  (see,  e.g.,  the  page  56  in  [18]  for 
stronger  assumptions)  stated  as  follows: 

Assumption  2.1  Define  K  :=  {(x,a)\x  £  X,  a  £  A}  and  p(y\k)  :=  p(y\x,a )  for  all  (x,a)  £  K. 
There  exists  a  positive  number  a  <  1  such  that 

sup  V  \p(y\k)  —  p(y\k')\  <  2a. 

The  above  ergodicity  assumption  implies  that  there  always  exists  an  optimal  policy  tt*  in  II, 
and  that  for  any  policy  it  £  II,  J^Q{x)  is  independent  of  the  starting  state  x,  from  which  we  write 
«/£,  omitting  x,  and  that  there  exists  a  bounded  measurable  function  hn  on  X  and  a  constant 
such  that  for  all  x  £  A, 

J£o  +  hn(x)  =  R(x,tt(x))  +  ^2p(y\x,Tr(x))K*(y).  (2) 

yex 

We  refer  to  Eq.  (2)  as  the  Poisson’s  equation  with  respect  to  tt. 

Let  B(X)  be  the  space  of  real- valued  bounded  measurable  functions  on  X  endowed  with  the 
supremum  norm  ||  V||  =  sup^  |V(x)|  for  V  £  B(X).  We  define  an  operator  T  :  B(X)  — >■  B(X)  as 
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T{V){. 


x 


=  max  {  R(x, 
clCA  ' 


,a)  +  ^ ~2p{y\x,a)V{y )  >  ,V  G  B(X),x 
yex  I 


G  X 


(3) 


and  let  {V^}  be  the  sequence  of  value  iteration  functions  V*  :=  T(V*_f)  where  n  =  1,2, ...  and  we 
assume  that  Vq(x)  =  0  for  all  x  E  X.  V*  might  not  converge  to  a  function  in  B(X)  as  n  — ►  oo. 
However,  an  appropriate  transformation  of  V*  does  converge.  We  state  this  fact  by  the  following 
theorem  (see  Theorem  4.8  (a)  in  [18]). 


Theorem  2.1  Assume  that  Assumption  2.1  holds.  For  all  n>  0, 

\\m 


1  —  a 


an  <  inf  \  Vil+i(x)  —  V*(x)\  — 

X 

<  sup  \V*+1(x)  -  V*{x)\  -J^<  •  an. 

X  l  -  a 


Let  us  define  an  operator  T1 r  :  B(X)  — *  B(X)  for  ir  G  n  as 


T7r(VT)(x)  =  R(x,  7 r(x))  +  ^2  p(y \x,  ir(x))Vn(y),  Vn  G  B(X),x  G  X 

yex 

and  let  {Vff }  be  the  sequence  of  value  iteration  functions  with  respect  to  7 r,  Vff  :=  Tvr(V)f_1)  where 
n  =  1,2, ...  and  Vgr(x)  =  0  for  all  x  G  X.  We  can  see  that  Vff  is  the  total  reward  over  horizon  of 
length  n  following  the  policy  7 r,  i.e,  Vff(x)  =  7r(xt))|a:o  =  x}-  The  above  theorem 

immediately  implies  the  following  corollary: 


Corollary  2.1  Assume  that  Assumption  2.1  holds.  For  all  n  >  0,  and  any  it  G  n,  and  all  x  G  X, 


M 

1  —  a 


a 


<  Vn+ 1(*) 


l  —  a 


a 


3  Receding  Horizon  Control 


We  define  the  receding  H-horizon  control  policy  tth  G  n  with  H  <  oo  as  a  policy  that  satisfies  for 
all  x  G  X, 

T{Vfj_1)(x)  =  T7tH(VfI_i)(x). 


It  has  been  shown  that  there  always  exists  a  minimal  finite- horizon  H  such  that  7 th(x)  =  ir*(x) 
for  all  x  G  X,  where  ir*  is  the  policy  that  achieves  sup7ren(JJ>)  under  Assumption  2.1  [17].  In 
addition  to  the  existence  of  such  a  finite  horizon,  the  paper  [17]  provides  an  algorithm  (stopping 
rule)  to  detect  such  a  horizon  in  a  finite  number  of  time  steps.  Furthermore,  Hernandez-Lerma  and 
Lasserre  [16]  showed  that 


0  <  <4  -  < 


\\m 

1  —  a 


■  a 


H- 1 
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where  we  can  see  that  the  performance  of  the  receding  horizon  control  policy  provides  a  good 
approximation  for  the  optimal  infinite-horizon  average  reward  and  the  error  approaches  zero  ge¬ 
ometrically  with  a.  Unfortunately,  obtaining  the  true  H- horizon  optimal  value  is  often  difficult, 
e.g.,  due  to  the  large  state  space  (see,  e.g.,  [29]  for  a  discussion  of  the  complexity  of  solving  finite- 
horizon  MDPs).  Motivated  by  this,  we  study  an  approximate  receding  horizon  control  that  uses 
an  approximate  value  function  as  an  approximate  solution  of  VfI_1  for  some  H  <  oo. 


3.1  Analysis  of  approximate  receding  horizon  control 

We  start  with  a  general  result  that  we  will  use  throughout  the  present  paper. 

Lemma  3.1  Assume  that  Assumption  2.1  holds.  Given  V  G  B(X),  consider  a  policy  7rl  G  II  such 
that 

T(V)(x)  =  Tnv{V)(x)  for  all  x  G  X. 

Then  a  stationary  distribution  PnV  over  X  exists  and  JJ,  =  ^2ye\[T(V)(y)  —  V (y)]PnV (y) . 

Proof:  The  proof  here  is  similar  to  that  on  the  page  65  in  [18].  We  provide  the  proof  of  this 
lemma  for  completeness.  From  Remark  3.2  (b)  in  [18],  for  any  stationary  policy  ir  G  II,  there 
is  a  unique  stationary  probability  distribution  Pn  satisfying  the  invariance  property  of  P7r( x)  = 
Ylyex  p(x\Vi  7r(y))P,r(y).  From  the  definition  of  ttv , 

T(V)(x)  =  R(x,ttv(x))  +  ^2p(y\x,irv(x))V(y). 

yex 

Now  summing  both  sides  with  respect  to  the  stationary  distribution  P^V , 

Y  T(V)(x)PnV (x)  =  Y  R{x,ttv(x))P7tV  (x)  +  Y  YP^X^V (x))V(y)P*V (x)- 

x£X  x€X  x£X  y£X 

The  first  term  on  the  right  side  is  equal  to  JJ,  by  Lemma  3.3  (b.ii)  in  [18],  and  the  second  term 
on  the  right  side  is  equal  to  V{y)PnX  (y)  from  the  invariance  property.  Rearranging  terms 

yields  the  desired  result.  ■ 

We  define  the  approximate  P-horizon  control  policy  ttv  as  a  policy  such  that  for  a  given 
V  G  B(X)  such  that  for  some  n  >  0,  \V*(x)  —  V(x)\  <  e  for  all  x  G  X,  it  satisfies 

T(V)(x)  =  Tnv(V)(x),x  G  X. 

We  now  state  and  prove  one  of  our  main  theorems. 


Theorem  3.1  Assume  that  Assumption  2.1  holds.  Given  V  G  B(X)  such  that  for  some  n  >  0, 
| V*(x)  —  V(x)\  <  e  for  all  x  in  X,  consider  a  policy  i rv  such  that  for  all  x  G  X,  T(V)(x)  = 
Tnv{V){x).  Then, 


n  <  T*  —  j7* 

u  ^  OO  ^  OO 


< 


M 

1  —  a 


an  +  2e. 
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Proof:  We  first  prove  that  if  \V*(x)  —  V(x)\  <  e  for  all  then  | T(V*)(x)  —  T(V)(x) \  <  e  for 

all  x  £  X.  This  simply  follows  from 


\T(V*)(x)  -  T(V)(x)\  <  max 

a£A 


^2iK(y)  -  V(y)]p(y\x,a) 

yex 


<  sup|yn*(x)  -  G(.x)|, 


where  the  first  inequality  follows  from  Hinderer’s  proposition  (page  123  in  [18]).  Therefore,  we  have 
that 


T(V*)(x)  -  V*(x)  -  2e  <  T(V)(x)  -  V(x)  <  T (O(x)  -  V*(x)  +  2e 

for  all  x  G  X  with  simple  algebra.  Now  by  Lemma  3.1,  =  J2yex PXV’Xy)  —  V'(y)]P7r(y).  It 

follows  that 


JZo  =  YlPWM  -  v(y)]Pn(y)  <  E[rTO(y)  -  v:(y)]p*(y)  +  2e 

yex  yex 

<  I T(V*)(x)  -  V*(x)\  +  2 e<J*00  +  •  on  +  2e, 

X  1  -  a 

where  the  last  inequality  is  from  Theorem  2.1.  The  lower  bound  is  trivial  by  the  definition  of 

I 


We  remark  that  the  infinite-horizon  average  reward  of  following  the  approximate  receding  hori¬ 
zon  control  via  V  is  bounded  by  a  term  due  to  the  finite-horizon  approximation  and  a  term  due  to 
the  approximation  of  V,  so  that  if  the  receding  horizon  is  “long”  enough  and  the  approximation  by 
V  is  good,  the  performance  bound  will  be  relatively  small.  If  e  =  0,  the  result  coincides  with  the 
one  obtained  in  [16].  As  n  — >  oo,  the  error  approaches  2e,  which  coincides  with  the  result  obtained 
by  White  [36]  for  finite  state  space  with  a  unichain  assumption.  Furthermore,  the  above  result  can 
be  extended  to  general  state  space  ( Borel  space)  with  appropriate  measure-theoretic  arguments. 


4  Examples  of  Approximate  Receding  Horizon  Control 

4.1  Rollout — finite-horizon  approximation  of  Howard’s  policy  improvement 

Given  a  policy  x  e  II,  suppose  we  solved  the  Poisson’s  equation  with  respect  to  7r  given  by  Eq.  (2): 

Joe  +  hn(x)  =  Tn(hn)(x),x  G  X, 

obtaining  a  function  hA  and  for  n.  Define  a  new  policy  ff  such  that  for  all  x  €  X, 

T*(h*)(x)  =  T(h*)(x). 


That  is,  for  all  x  G  X, 


7 t(x)  £  argrnax  R(x,a)  +  >  p(y\x,  a)^  (y) 

a^A  \  fitx 


(4) 
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It  is  well-known  that  r r  improves  n  in  the  sense  that  under  an  appropriate  condition  for 

a  given  MDP,  and  it  is  called  Howard’s  policy  improvement.  See,  e.g.,  Chapter  7  in  [12]  for  a  proof 
with  a  finite  state  space  under  an  irreducibility  condition  via  the  vanishing  discount  approach. 
For  general  state  space,  see,  e.g.,  [27].  In  general,  the  proof  of  the  policy  improvement  for  the 
average  reward  case  is  not  straightforward.  We  provide  here  a  simple  alternative  proof  under 
Assumption  2.1: 

J°°  =  ’y^J[T(h7T)(x)  —  hA  (x)\Pn  (x)  from  Lemma  3.1 

X 

>  £[tw(/U)(*)-/U(*)]p*(*) 

X 

=  ]T[J£  +  ^(*)-^(x)]P%) 


Unfortunately,  obtaining  the  /U-function  is  often  very  difficult,  so  we  consider  an  approximation 
scheme  that  is  inrplementable  in  practice  via  Monte-Carlo  simulation.  As  a  finite  approximation 
of  the  policy  improvement  scheme,  we  replace  hA  by  the  finite-horizon  value  function  of  the  policy 
7 r.  The  value  of  following  the  given  policy  7r  can  be  simply  estimated  by  a  sample  mean  over  a 
set  of  sample-paths  generated  by  Monte-Carlo  simulation.  The  very  idea  of  simulating  a  given 
(heuristic)  policy  to  obtain  an  (approximately)  improved  policy  originated  from  Tesauro’s  work  in 
backgammon  [35]  and  recently,  Bertsekas  and  Castanon  extended  the  idea  into  an  on-line  policy 
improvement  scheme  called  “rollout”  to  solve  finite-horizon  MDPs  with  total  reward  criterion.  It  is 
an  on-line  scheme  in  the  sense  of  “planning” .  That  is,  at  each  decision  time  t,  we  rollout  the  given 
base  policy  to  estimate  the  utility  (called  ((Lvalue)  of  taking  an  initial  action  at  state  xt  and  take 
the  action  with  the  highest  utility,  which  creates  effectively  an  improved  policy  of  the  base  policy 
in  an  on-line  manner. 

Formally,  we  define  the  H- horizon  rollout  policy  TTr0)H  G  n  with  a  base  policy  7r  and  H  <  oo  as 
the  policy: 


tt ro,H(x)  G  argrnax 
aeA 


R(x,  a)  +  J2p(v\ x^)Vh-M 
yex 


,  x  G  X. 


(5) 


Note  that  V^_](x)  is  a  lower  bound  to  the  V^_1(x)  for  all  x  G  X.  From  the  result  of  Theo¬ 
rem  3.1,  if  Vfi_1  is  a  good  approximation  of  V^_1,  the  resulting  performance  will  be  close  to  that 
of  the  true  receding  horizon  control  policy.  Note  also  that  the  finite-horizon  approximation  of  the 
policy  improvement  does  not  use  a  function  that  approximates  hA  directly  but  we  use  an  approxi¬ 
mation  function  for  We  will  discuss  this  issue  in  the  next  subsection  in  more  detail.  The 

question  is  how  the  iL-horizon  rollout  policy  performs  relatively  to  the  policy  7r  that  it  rolls  out  in 
terms  of  infinite-horizon  average  reward. 
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Theorem  4.1  Assume  that  Assumption  2.1  holds.  Consider  the  H -horizon  rollout  policy  tt r0tH 
with  a  base  policy  tt  and  H  <  oo.  Then 

j^ro.H  \  jn  Ml  „H  —  1 

j  oo  >  Joo~  - a 


Proof: 


j^ro,H 
*4  OO 


> 

> 

> 


^[T(V£_i)(x)  —  Pi5'_1(.T)]P7rro'ff(x)  from  Lemma  3.1 

X 

mmVS^W-VS^x)) 

X 

inf(V^(.x)  —  Vfi_1(x))  by  the  definition  of  T-operator 

X 

J/L - &  .  a11-1  from  Corollary  2.1. 

1  —  a 


■ 


The  above  result  immediately  allows  us  to  obtain  the  value  of  H  to  have  a  desired  approximate 
on-line  policy  improvement  performance.  That  is,  given  any  e  >  0,  if  we  let  aH_1  ■  <  e  so  that 

H>  1  +  loga  then  J^°'H  >  -  e. 

Several  papers  for  MDP  problems  (with  their  related  cost  criteria)  based  on  a  one-step  policy 
improvement  idea  have  reported  successful  results.  For  example,  Bertsekas  and  Castanon  studied 
stochastic  scheduling  problems  [4],  Secomandi  [34]  applied  the  rollout  technique  combined  with 
neuro-dynamic  programming  [5]  to  a  vehicle  routing  problem,  Ott  and  Krishnan  [30]  and  Kolarov 
and  Hui  [22]  studied  network  routing  problems,  Bhulai  and  Koole  [7]  consider  a  multi-server  queue¬ 
ing  problem,  and  Koole  and  Nain  [24]  consider  a  two-class  single-server  queueing  model  under  a 
preemptive  priority  rule.  In  particular,  [7]  and  [24]  obtained  explicit  form  expressions  for  the  value 
function  of  a  fixed  threshold  policy,  which  plays  the  role  of  the  heuristic  base  policy,  and  showed 
numerically  that  the  rollout  policy  generated  from  the  threshold  policy  behaves  almost  optimally. 
Chang  et  al.  [10]  also  empirically  showed  the  rollout  of  a  fixed  threshold  policy  (Droptail)  per¬ 
forms  well  for  a  buffer  management  problem.  Koole  [23]  also  derived  the  deviation  matrix  of  the 
M/M/ l/oo  and  M/M/X/N  queue,  which  is  used  for  computing  the  bias  vector  for  a  particular 
choice  of  cost  function  and  a  certain  base  policy,  from  which  the  rollout  policy  of  the  base  policy 
is  generated.  Even  though  the  value  function  of  a  particular  policy  can  be  obtained  explicitly  for 
relatively  simple  cases  from  problem  structure  analysis,  calculating  the  exact  value  function  of  a 
particular  policy  is  in  general  very  difficult  in  practice,  in  which  case  we  apply  the  receding  roll¬ 
out  policy  via  simulation.  If  we  have  a  good  heuristic  policy,  this  approach  often  provides  good 
performance,  improving  the  performance  of  the  given  heuristic  policy  (see,  e.g.,  [10]  for  queueing 
problems  regarding  the  performance  of  the  receding  horizon  rollout  policy). 
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4.2  Parallel  rollout 

The  rollout  approach  is  promising  if  we  have  a  good  base  policy,  because  the  performance  of  the 
rollout  policy  is  no  worse  than  that  of  the  base  policy.  Note  that  in  practice,  what  we  are  really 
interested  in  is  the  ranking  of  actions,  not  the  degree  of  approximation.  Therefore,  as  long  as  the 
rollout  policy  preserves  the  true  ranking  of  actions  well,  the  resulting  policy  will  perform  fairly  well. 
However,  when  we  have  multiple  policies  available,  because  we  cannot  predict  the  performance  of 
each  policy  in  advance,  selecting  a  particular  single  base  policy  to  be  rolled  out  is  not  an  easy  task. 
Furthermore,  for  some  cases,  each  base  policy  available  is  good  for  different  system  trajectories. 
(See,  for  example,  a  multiclass  scheduling  problem  with  deadlines  discussed  in  [9]  where  the  static 
priority  policy  and  the  earliest  deadline  first  policy  perform  optimally  for  different  paths  of  states.) 
When  this  is  the  case,  we  wish  to  combine  these  base  policies  dynamically  in  an  on-line  manner 
to  generate  a  single  policy  which  adapts  automatically  to  different  trajectories  of  the  system,  in 
addition  to  alleviating  the  difficulty  of  choosing  a  single  base  policy  to  be  rolled  out. 

To  this  end,  we  first  study  a  generalization  of  Howard’s  policy  improvement  scheme  for  multiple 
policies  and  then  consider  a  finite-horizon  approximation  of  the  generalized  scheme. 

Given  a  finite  set  A  C  n,  suppose  that  for  each  tt  G  A,  we  solved  the  Poisson’s  equation  with 
respect  to  tt  given  by  Eq.  (2),  obtaining  a  function  hA  and  for  tt. 

Note  that  the  function  hn  that  satisfies  the  Poisson’s  equation  with  respect  to  tt  is  not  necessarily 
unique  [1,  33].  Under  Assumption  2.1,  the  following  function  known  as  the  “relative  value  function” 

lim  (V?(x)  -  1^(0)), 

7->l 

where  0  is  an  arbitrarily  fixed  state  in  X  and  V™(x)  is  the  infinite-horizon  discounted  reward  of 
following  tt,  starting  with  state  x,  given  by 

OO 

E  ^2  7 tR(xt,  TT(xt))  x0  =  x 
_t= o 

solves  the  Poisson  equation  with  respect  to  tt  [18].  Another  important  hA  that  satisfies  the  Poisson’s 
equation  with  respect  to  tt  is  the  bias.  If  there  exists  a  state  0£l  that  is  reachable  from  any  state 
in  AT  in  a  finite  number  of  time  steps  by  following  any  fixed  stationary  policy,  then  the  function 
gn  €  B(X)  defined  by 

gn(x)  =  lim  (V*(x)  -  nJ^),x  G  X, 

n— kxd 

satisfies  the  Poisson’s  equation  with  respect  to  tt  and  is  called  the  bias  [27].  Therefore,  g *  can  be 
taken  as  hn .  If  X  is  finite  and  the  given  MDP  is  unichain  and  if  we  add  the  condition  of 

J2  P*[x)h*{x)  =  0 

xex 

to  the  Poisson’s  equation  with  respect  to  tt,  the  bias  is  the  unique  solution  to  the  Poisson’s  equation 
with  respect  to  tt  (see,  e.g.,  [1,  12,  25]).  For  the  relationship  between  relative  value  function  and  bias 
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and  the  computation  of  a  bias-optimal  policy,  see,  e.g.,  [25]  for  finite  state  and  action  spaces  with 
a  unichain  assumption.  Our  discussion  in  this  section  will  focus  on  the  bias  but  can  be  extended 
to  the  relative  value  function. 

Define  a  value  function  $  £  B(X)  such  that 

<h(x)  =  rna  xhn(x) 

7TSA 

and  define  a  new  policy  if  such  that  for  all  x  £  X, 

T*(<E>)(x)  =  T(<E>)(x) 


and  call  if  a  “parallel  rollout”  policy.  That  is,  for  all  x  £  X, 


tt(x)  £  argrnax  j  R(x,  a)  +  p{y\x,  a)  max  hn (y) 


ieA 


yex 


(6) 


Let  A  =  arg  rnax^gy^  .  We  say  that  5  is  a  gain-optimal  policy  in  A  if  5  £  A,  and  <5  is  a 
bias-optimal  policy  in  A  if  6  £  A  and  hs(x )  >  rnax^gA  hA(x)  for  all  x  £  X. 


Theorem  4.2  Given  a  finite  set  A  C  II,  suppose  that  for  each  ir  £  A,  h n  satisfies  the  Poisson’s 
equation  with  respect  to  ir  given  by  Eq.  (2): 

J^  +  hn{x)  =  Tn{hn){x),xeX. 


Consider  if  defined  in  Eq.  (6). 

a)  If  there  exists  a  bias- optimal  policy  in  A,  then 


4>max,/-. 

7T€  A 


For  any  gain- optimal  policy  6  in  A, 


4  >  max  4 

7reA 


sup  I  max(/i7r(x)) 
xcx  V^eA 


b) 


Ji>Yl  jarsmax^A(^w)p*(X). 

xex 


We  provide  the  proof  of  this  theorem  first  before  we  discuss  how  these  bounds  can  be  interpreted. 
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Proof:  Observe  that  for 

n*){x)  = 

> 

> 


Because  the  above  inequality  holds  for  any  it  £  A,  for  all 

T{$)(x)  >  ma x(J£  +  hP{x)). 

7TSA 

Now, 

Joe  =  ^^[T($)(:e)  —  <h(x)]P7r(.x)  from  Lemma  3.1 

xex 

>  \  [max(J^  +  h7T( x))  —  max  h7r(x)\Pw(x)  by  the  previous  observation  (7) 

^  7rGA  7rGA 

xGX 

For  part  a),  selecting  a  gain-optimal  policy  S  £  A  that  achieves  maxne/v(J^c)  yields 
J£c  >  ^2[JL  +  hS(x)  -m&x/i7r(x)]P7r(x) 

-  7rGA 

>  JL  +  £[*'(*)  “  ma fhw(x)]Pn(x) 

‘  ^  7rGA 

aiGX 

max  hP  (x)  —  h5  (x) 

7rGA 

If  5  is  a  bias-optimal  policy  in  A,  Ylxex  [hs(x)  —  maXjrgA  hn  (x)\Pn  (x)  >  0.  Thus  for  this  case, 

J£o  >  max 

7TSA 

For  part  6),  at  each  x,  selecting  a  policy  in  A  that  achieves  maxne\(hP  (x))  in  Eq.  (7)  yields  the 
desired  result  with  simple  algebra.  I 

An  interpretation  of  the  above  theorem  is  as  follows.  Part  a)  of  Theorem  4.2  states  that  the 
parallel  rollout  policy  improves  the  infinite-horizon  average  reward  of  a  gain-optimal  policy  in  A 
and  the  error  is  bounded  by  the  maximal  difference  of  the  biases  achieved  by  the  gain-optimal 
policy  and  the  policies  in  A,  and  it  is  guaranteed  that  the  the  parallel  rollout  policy  improves  the 
gain  of  any  policy  in  A  if  A  contains  at  least  one  bias-optimal  policy.  Part  b )  states  that  the  gain 
of  the  parallel  rollout  policy  is  no  worse  than  the  average  gain  of  the  best  policy  that  achieves  the 


>  max  J A  —  sup 

7TSA  x 


any  x  £  X, 


max  R(x,a)  +  >  p(y\x,a)$>(y) 
acA  I 


y&x 


R(x,tv(x))  +  p{y\x,  Tt(x))$>(y)  for  any  ir  £  A 


yex 


R(x,ir(x))  +  p(y\x,  x(x))hP  (y)  from  the  definition  of  $ 
yex 

JZo  +  hP{x). 
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maximal  bias  value  at  each  state,  where  the  average  is  taken  over  the  stationary  distribution  of  7r 
(it  can  be  thought  as  an  initial  distribution  over  X ).  That  is, 

(J~  -  J^rgmaX-eA/l"(x))  p*(x)  >  0. 

xex 

We  now  consider  a  finite-horizon  approximation  of  the  parallel  rollout  control  policy  within  the 
framework  of  the  approximate  receding  horizon  control.  The  direct  generalization  of  the  rollout 
policy  defined  in  the  previous  subsection  is  to  replace  maxn£\h7T(x),x  £  X  by  the  maximum  of 
the  values  of  the  policies  in  A  for  a  finite  horizon  at  the  state  x.  Formally,  we  define  the  77- horizon 
parallel  rollout  policy  irprtH  with  a  finite  set  A  C  II  of  base  policies  in  II  as 

Xpr,H(x)  £  argrnax 
acA 

We  can  first  easily  see  that  this  is  based  on  a  more  accurate  lower  bound  of  the  optimal  total 
reward  value  than  that  of  the  17-horizon  rollout  policy  and  if  supxg^  |  rnax^gA  Vjj_  1  (x)  —  Vfi_  1  (x)  |  < 
e,  by  Theorem  3.1,  the  performance  will  be  close  to  that  of  the  true  receding  horizon  approach.  If 
we  view  hA  as  the  relative  value  function  or  the  bias,  in  the  definitions  of  the  rollout  and  parallel 
rollout  policies,  we  don’t  directly  estimate  hA .  One  can  use  a  finite-horizon  approximation  of  IP 
directly.  For  example,  we  could  use  V^_1(x)  —  V^_1(0)  with  a  fixed  state  0£l  instead  of  V^_1(x) 
for  an  approximate  value  of  the  relative  value  function.  The  result  of  Theorem  4.1  still  holds  with 
this  replacement  via  Corollary  2.1  with  the  simple  observation  that 

J~°'H  =  EPW-iX®)  -  VH- 1(°)  -  VH- l(s)  +  VS-AO )}P^H(x). 

X 

The  main  reasons  that  we  use  the  total  reward  value,  not  the  relative  total  reward  value  or  the 
bias,  are  twofold.  First,  we  want  our  approximation  scheme  to  be  within  our  framework  of  the 
approximate  receding  horizon  control,  which  allows  us  to  compare  the  performance  of  the  (parallel) 
rollout  policy  with  the  optimal  infinite-horizon  average  reward.  Second,  we  want  to  keep  the  spirit 
of  the  (parallel)  rollout  policy  defined  for  the  discounted  reward  criterion.  (For  the  infinite-horizon 
discounted  criterion,  we  simply  replace  the  total  reward  of  following  a  policy  in  the  definition  of 
the  (parallel)  rollout  policy  by  the  total  discounted  reward.  See  [9,  10].) 

We  analyze  below  the  performance  of  the  77-horizon  parallel  rollout  policy  compared  with  the 
infinite-horizon  average  rewards  obtained  by  policies  in  A.  To  this  end,  for  any  ir  £  II,  define 
J£(x)  =  Vn  ^  f°r  all  x  £  X  and  n  =  1,2,....  That  is,  this  is  the  n- horizon  approximation  of  the 
infinite-horizon  average  reward.  With  a  similar  argument  as  Platzman’s  given  in  Section  3.3  in  [32], 
we  can  show  that  J^(x )  converges,  uniformly  in  x,  as  0(n-1),  to  J^,  n  =  1,2, .... 


R(x,  a)  +  V  p(y \x,  a)  max  Vg-i (y)  |  ,  x  G  X. 

‘  ^  7rGA 

yex 


(8) 


Theorem  4.3  Assume  that  Assumption  2.1  holds.  Consider  the  77 -horizon  parallel  rollout  policy 
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7 Tpr  h  with  a  finite  set  A  C  II  and  H  <  oo.  Then 


J 


T^pr,  H 


argmaxw6A 


OO  >  £  - 

xCX 


M 

1  —  a 


a 


H- 1 


Proof:  From  Corollary  2.1,  for  all  x  &  X, 


v, m  -  V3_j(x)  >  J*  -  J3-  •  a"-1. 

1  —  a 

Define  $  G  B(X)  such  that  <b(x)  =  rnax^gA  Vjj_1(x)  for  all  x  G  X.  Observe  that  for  all  iGl, 
T(4>)(x)  =  max  \  R(x,a) +  y2p(y\x,a)<Z>(y) 

a£A  »  * — ' 


yex 


> 


R(x,tt(x))  +  p(y\x,  ir(x))Q(y)  for  any  7r  G  A 


y&X 


Therefore,  for  all  x  €  X, 


Now, 


>  R(x,ir(x))  +  ^2p(y\x,ir(x))Vg_1(y)  =  Vg{x). 

y&X 


T($)(x)  >  ma xVZ(x). 


(9) 


t  ^pr,H  _ 

^OO 


^[T(<L)(x)  —  &(x)\Pnpr’H (x)  by  Lemma  3.1 

X 

>  y^[max  Vfi(x)  —  max  Vfi_1(x)\Pnpr’H (x)  by  Eq.  (9) 

7rEA  7rGA 

X 


>  by  Corollary  2.1  . 

'  l  —  a 


As  we  increase  Pf,  the  term  with  a  will  decrease  geometrically  fast  in  a  and  J^_1(x),x  G  X 
approaches  in  0(H _1).  An  interpretation  of  this  result  is  as  follows.  The  infinite-horizon 
average  reward  by  following  the  H- horizon  parallel  rollout  policy  is  no  worse  than  that  by  following 
the  best  policy  in  A  that  achieves  the  maximum  fL-horizon  average  reward  associated  with  a 
starting  state  x  among  the  policies  in  A,  where  the  distribution  of  the  starting  state  is  given  by  the 
stationary  distribution  of  following  the  parallel  rollout  policy. 

We  conclude  this  section  with  a  brief  discussion  of  some  intuition  behind  the  parallel  rollout 
policy.  Consider  a  policy  fin  that  selects  the  action  given  by  the  policy  7r  in  A  that  has  the  highest 
J£(x)  estimate  at  the  current  state  x.  That  is,  at  state  x,  4>h  takes  an  action  given  by 

argrnax  (J^-(x))  (x). 
yreA 
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Note  that  this  policy  will  converge  to  the  policy  arg max^^  in  0(H~l).  However,  this  receding 
horizon  policy,  <f>H,  selects  only  the  action  prescribed  by  7r  £  A.  In  other  words,  the  policy  does 
not  give  enough  emphasis  and  freedom  in  the  evaluation  to  the  initial  action  (this  drawback  has 
been  empirically  shown  in  [10]).  Therefore,  we  conjecture  informally  that  this  policy  is  generally 
suboptinral  even  though  we  can  expect  that  </>#  is  much  more  uniformly  reasonable  (across  the  state 
space)  than  any  single  base  policy  in  A.  On  the  other  hand,  the  receding  horizon  rollout  policy 
evaluates  each  possible  initial  action  based  on  one-step  lookahead  relative  to  a  base  policy  being 
improved.  We  can  view  the  parallel  rollout  technique  as  a  method  of  capturing  the  spirit  of  rolling 
out  4>h  with  a  low  cost  (see,  also  [10]  on  the  similar  discussion  for  discounted  reward  criterion). 

5  Concluding  Remarks 

When  we  simulate  a  base  policy  by  Monte-Carlo  simulation,  using  different  sets  of  random  number 
sequences  (different  sample-paths)  across  actions  increases  the  variance  in  the  utility  (Q-value) 
measure.  Therefore,  we  suggest  using  the  same  set  of  random  number  sequences  across  actions. 
This  has  the  same  flavor  as  the  differential  training  method  [3]  and  common  random  number 
simulation  in  the  discrete  event  systems  literature  [19]. 

There  are  several  papers  in  the  literature  regarding  the  simulation-based  policy  iteration  method 
where  the  policy  evaluation  step  is  done  via  simulation.  Rather  than  estimating  a  finite-horizon 
total-reward  value  of  a  policy,  those  papers  consider  approximating  hn  directly.  For  example, 
Cooper  et  al.  [11]  use  a  sampling  method  called  “coupling- from-the-past”  that  requires  obtaining 
a  sample  from  the  stationary  distribution  of  the  (aperiodic)  Markov  chain  generated  by  a  fixed 
policy  and  He  et  al.  [15]  use  a  temporal-difference  learning  scheme  in  order  to  estimate  the  bias  of 
the  policy,  where  both  papers  are  under  the  finite  state  and  action  space  constraint  and  a  unichain 
assumption.  On  the  other  hand,  Bertsekas  and  Tsitsiklis  [5]  discuss  estimating  hn  defined  as  the 
relative  value  function  via  Monte-Carlo  simulation. 
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